942 research outputs found

    A log-ratio biplot approach for exploring genetic relatedness based on identity by state

    Get PDF
    The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degreePostprint (published version

    Los toros y la Fiesta en Aracena

    Get PDF

    Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data

    Get PDF
    Background: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. Results: The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. Conclusions: We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the “core genes”, mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the “core pathways” with apparent unrelated biological functionality.Peer ReviewedPostprint (published version

    An Autosomal-Recessive Form of Cutis Laxa Is Due to Homozygous Elastin Mutations, and the Phenotype May Be Modified by a Heterozygous Fibulin 5 Polymorphism

    Get PDF
    Cutis laxa (CL) is a heterogeneous group of connective tissue disorders characterized by loose, sagging skin and variable involvement of other organs. Autosomal-dominant forms are relatively mild, and may be caused by mutations in the elastin gene, whereas the more severe recessive forms have been associated with mutations in the fibulin 4 and fibulin 5 genes, as well as in a vesicular ATPase subunit. We describe here a previously unreported autosomal-recessive form of CL caused by homozygous recessive mutations in exon 12 of the elastin gene (p.P211S) in three patients from two related consanguineous Syrian families. Furthermore, we found that the presence of a polymorphism in the fibulin 5 gene in one of the patients seems to modify the phenotype, producing more severe symptoms. This polymorphism (p.L301M) was associated with mild symptoms in the mother of the patient, who was heterozygous for both the elastin and fibulin 5 mutations. To our knowledge, autosomal-recessive CL owing to homozygous mutations in the elastin gene has not been reported previously

    Disease networks identify specific conditions and pleiotropy influencing multimorbidity in the general population

    Get PDF
    Multimorbidity is an emerging topic in public health policy because of its increasing prevalence and socio-economic impact. However, the age- and gender-dependent trends of disease associations at fine resolution, and the underlying genetic factors, remain incompletely understood. Here, by analyzing disease networks from electronic medical records of primary health care, we identify key conditions and shared genetic factors influencing multimorbidity. Three types of diseases are outlined: "central", which include chronic and non-chronic conditions, have higher cumulative risks of disease associations; "community roots" have lower cumulative risks, but inform on continuing clustered disease associations with age; and "seeds of bursts", which most are chronic, reveal outbreaks of disease associations leading to multimorbidity. The diseases with a major impact on multimorbidity are caused by genes that occupy central positions in the network of human disease genes. Alteration of lipid metabolism connects breast cancer, diabetic neuropathy and nutritional anemia. Evaluation of key disease associations by a genome-wide association study identifies shared genetic factors and further supports causal commonalities between nervous system diseases and nutritional anemias. This study also reveals many shared genetic signals with other diseases. Collectively, our results depict novel population-based multimorbidity patterns, identify key diseases within them, and highlight pleiotropy influencing multimorbidity

    A Log-Ratio Biplot Approach for Exploring Genetic Relatedness Based on Identity by State

    Get PDF
    The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree

    Impacts of Saharan dust intrusions on bacterial communities of the low troposphere

    Get PDF
    We have analyzed the bacterial community of a large Saharan dust event in the Iberian Peninsula and, for the frst time, we ofer new insights regarding the bacterial distribution at diferent altitudes of the lower troposphere and the replacement of the microbial airborne structure as the dust event receeds. Samples from diferent open-air altitudes (surface, 100m and 3km), were obtained onboard the National Institute for Aerospace Technology (INTA) C-212 aircrafts. Samples were collected during dust and dust-free air masses as well two weeks after the dust event. Samples related in height or time scale seems to show more similar community composition patterns compared with unrelated samples. The most abundant bacterial species during the dust event, grouped in three diferent phyla: (a) Proteobacteria: Rhizobiales, Sphingomonadales, Rhodobacterales, (b) Actinobacteria: Geodermatophilaceae; (c) Firmicutes: Bacillaceae. Most of these taxa are well known for being extremely stress-resistant. After the dust intrusion, Rhizobium was the most abundant genus, (40–90% total sequences). Samples taken during the fights carried out 15 days after the dust event were much more similar to the dust event samples compared with the remaining samples. In this case, Brevundimonas, and Methylobacterium as well as Cupriavidus and Mesorizobium were the most abundant genera

    GCAT|Genomes for life: a prospective cohort study of the genomes of Catalonia

    Get PDF
    PURPOSE: The prevalence of chronic non-communicable diseases (NCDs) is increasing worldwide. NCDs are the leading cause of both morbidity and mortality, and it is estimated that by 2030, they will be responsible for 80% of deaths across the world. The Genomes for Life (GCAT) project is a long-term prospective cohort study that was designed to integrate and assess the role of epidemiological, genomic and epigenomic factors in the development of major chronic diseases in Catalonia, a north-east region of Spain. PARTICIPANTS: At the end of 2017, the GCAT Study will have recruited 20 000 participants aged 40-65 years. Participants who agreed to take part in the study completed a self-administered computer-driven questionnaire, and underwent blood pressure, cardiac frequency and anthropometry measurements. For each participant, blood plasma, blood serum and white blood cells are collected at baseline. The GCAT Study has access to the electronic health records of the Catalan Public Healthcare System. Participants will be followed biannually at least 20 years after recruitment. FINDINGS TO DATE: Among all GCAT participants, 59.2% are women and 83.3% of the cohort identified themselves as Caucasian/white. More than half of the participants have higher education levels, 72.2% are current workers and 42.1% are classified as overweight (body mass index ≥25 and <30 kg/m2). We have genotyped 5459 participants, of which 5000 have metabolome data. Further, the whole genome of 808 participants will be sequenced by the end of 2017. FUTURE PLANS: The first follow-up study started in December 2017 and will end by March 2018. Residences of all subjects will be geocoded during the following year. Several genomic analyses are ongoing, and metabolomic and genomic integrations will be performed to identify underlying genetic variants, as well as environmental factors that influence metabolites
    corecore